{
 "metadata": {
  "name": "",
  "signature": "sha256:da79d5a34aa681df25cb0948ed59d95ce3881d04a9c74ad86ddbff3217b0e448"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Chapter 2: First steps into text processing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The previous chapter has hopefully whet your appetite. In this chapter we will focus on one of the most important tasks in Humanities research: text processing. One of the goals of text processing is to clean up your data as a pre-step to some kind of data analysis. Another common goal is to convert a given text collection to a different format. In this chapter we will provide you with the necessary tools to work with collections of texts, clean them and perform some rudimentary data analyses on them."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Reading files"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Say you have a text stored on your computer. How can we read that text using Python? Python provides a really simple function called `open` with which we can read texts. In the folder `data` you find a couple of small text excerpts that we will use in this chapter. Go ahead and have a look at them. We can open these files with the following command:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "infile = open('data/austen-emma-excerpt.txt')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now print `infile`. What do you think that will happen?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(infile)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\"Hey! That's not what I expected to happen!\", you might think. Python is not printing the contents of the file but only some mysterious mention of some `TextIOWrapper`. This `TextIOWrapper` thing is Python's way of saying it has *opened* a connection to the file `data/austen-emma-excerpt.txt`. In order to *read* the contents of the file we must add the function `read` as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(infile.read())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "`read` is a function that operates on `TextWrapper` objects and allows us to read the contents of a file into Python. Let's assign the contents of the file to the variable `text`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "infile = open('data/austen-emma-excerpt.txt')\n",
      "text = infile.read()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The variable `text` now holds the contents of the file `data/austen-emma-excerpt.txt` and we can access and manipulate it just like any other string. After we read the contents of a file, the `TextWrapper` no longer needs to be open. In fact, it is good practice to close it as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "infile.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Just to recap some of the stuff we learnt in the previous chapter. Can you write code that defines the variable `number_of_es` and counts how many times the letter *e* occurs in `text`? (Tip: use a `for` loop and an `if` statement)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "number_of_es = 0\n",
      "# insert your code here\n",
      "\n",
      "# The following test should print True if your code is correct \n",
      "print(number_of_es == 78)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Writing our first function"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the previous quiz, you probably wrote a loop that iterates over all characters in `text` and adds 1 to `number_of_es` each time the program finds the letter *e*. Counting objects in a text is a very common thing to do. Therefore, Python provides the convenient function `count`. This function operates on strings (`somestring.count(argument)`) and takes as argument the object you want to count. Using this function, the solution to the quiz above can now be rewritten as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "number_of_es = text.count(\"e\")\n",
      "print(number_of_es)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In fact, `count` takes as argument any string you would like to find. We could just as well count how often the determiner `an` occurs:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(text.count(\"an\"))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The string `an` is found 12 times in our text. Does that mean that the word *an* occurs 12 times in our text? Go ahead. Count it yourself. In fact, *an* occurs only twice... Think about this. Why does Python print 12?\n",
      "\n",
      "If we want to count how often the word *an* occurs in the text and not the string `an`, we could surround *an* with spaces, like the following:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(text.count(\" an \"))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Although it gets the job done in this particular case, it is generally not a very solid way of counting words in a text. What if there are instances of *an* followed by a semicolon or some end-of-sentence marker? Then we would need to query the text multiple times for each possible context of *an*. For that reason, we're going to approach the problem using a different, more sophisticated strategy. \n",
      "\n",
      "Recall from the previous chapter the function `split`. What does this function do? The function `split` operates on a string and splits a string on spaces and returns a list of smaller strings (or words):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(text.split())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "All the things you have learnt so far should enable you to write code that counts how often a certain items occurs in a list. Write some code that defines the variable `number_of_hits` and counts how often the word *in* (assigned to `item_to_count`) occurs in the the list of words called `words`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "words = text.split()\n",
      "number_of_hits = 0\n",
      "item_to_count = \"in\"\n",
      "# insert your code here\n",
      "\n",
      "# The following test should print True if your code is correct \n",
      "print(number_of_hits == 3)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We will go through the previous quiz step by step. We would like to know how often the preposition *in* occurs in our text. As a first step we will split the string `text` into a list of words:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "words = text.split()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we define a variable `number_of_hits` and set it to zero."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "number_of_hits = 0"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The final step is to loop over all words in `words` and add 1 to `number_of_ins` if we find a word that is equal to `in`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "item_to_count = \"in\"\n",
      "for word in words:\n",
      "    if word == item_to_count:\n",
      "        number_of_hits += 1\n",
      "print(number_of_hits)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, say we would like to know how often the word *of* occurs in our text. We could adapt the previous lines of code to search for the word *of*, but what if we also would like to count the number of times *the* occurs, and *house* and *had* and... It would be really cumbersome to repeat all these lines of code for each particular search term we have. Programming is supposed to reduce our workload, not increase it. Just like the function `count` for strings, we would like to have a function that operates on lists, takes as argument the object we would like to count and returns the number of times this object occurs in our list.\n",
      "\n",
      "In this and the previous chapter you have already seen lots of functions. A function does something, often based on some argument you pass to it, and generally returns a result. You are not just limited to using functions in the standard library but you can write your own functions.\n",
      "\n",
      "In fact, you *must* write your own functions. Separating your problem into sub-problems and writing a function for each of those is an immensely important part of well-structured programming. Functions are defined using the `def` keyword, they take a name and optionally a number of parameters. \n",
      "\n",
      "    def some_name(optional_parameters):\n",
      "\n",
      "The `return` statement returns a value back to the caller and always ends the execution of the function. \n",
      "\n",
      "Going back to our problem, we want to write a function called `count_in_list`. It takes two arguments: (1) the object we would like to count and (2) the list in which we want to count that object. Let's write down the function definition in Python:\n",
      "\n",
      "    def count_in_list(item_to_count, list_to_search):\n",
      "    \n",
      "Do you understand all the syntax and keywords in the definition above? Now all we need to do is to add the lines of code we wrote before to the body of this function:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def count_in_list(item_to_count, list_to_search): \n",
      "    number_of_hits = 0                            \n",
      "    for item in list_to_search:                   \n",
      "        if item == item_to_count:                 \n",
      "            number_of_hits += 1                   \n",
      "    return number_of_hits                         "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "All code should be familiar to you, except the `return` keyword. The `return` keyword is there to tell python to return as a result of calling the function the argument `number_of_hits`. OK, let's go through our function one more time, just to make sure you really understand all of it.\n",
      "\n",
      "1. First we define a function using `def` and give it the name `count_in_list` (line 1);\n",
      "2. This function takes two arguments: `item_to_count` and `list_to_search` (line 1);\n",
      "3. Within the function, we define a variable `number_of_hits` and assign to it the value zero (since at that stage we haven't found anything yet (line 2));\n",
      "4. We loop over all words in `list_to_search` (line 3);\n",
      "5. If we find a word that is equal to `item_to_count` (line 4), we add 1 to `number_of_hits` (line 5);\n",
      "6. Return the result of `number_of_hits` (line 6).\n",
      "\n",
      "Let's test our little function! We will first count how often the word *an* occurs in our list of words `words`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(count_in_list(\"an\", words))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Using the function we defined, print how often the word *the* occurs in our text"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# insert your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "A more general count function"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Our function `count_in_list` is a concise and convenient piece of code allowing us to rapidly and without too much repitition count how often certain items occur in a given list. Now what if we would like to find out for all words in our text how often they occur. Then it would be still quite cumbersome to call our function for each unique word. We would like to have a function that takes as argument a particular list and counts for each unique item in that list how often it occurs. There are multiple ways of writing such a function. We will show you two ways of doing it."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "A count function (take 1)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the previous chapter you have acquainted yourself with the `dictionary` structure. Recall that a dictionary consists of keys and values and allows you to quickly lookup a value. We will use a dictionary to write the function `counter` that takes as argument a list and returns a `dictionary` with `keys` for each unique item and `values` showing the number of times it occurs in the list. We will first write some code without the function declaration. If that works, we will add it, just as before, to the body of a function.\n",
      "\n",
      "We start with defining a variable `counts` which is an empty dictionary:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "counts = {}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we will loop over all words in our list `words`. For each word, we check whether the dictionary already contains it. If so, we add 1 to its value. If not, we add the word to the dictionary and assign to it the value 1."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for word in words:\n",
      "    if word in counts:\n",
      "        counts[word] = counts[word] + 1\n",
      "    else:\n",
      "        counts[word] = 1\n",
      "print(counts)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you don't remember anymore how dictionaries work, go back to the previous chapter and read the part about dictionaries once more.\n",
      "\n",
      "Now that our code is working, we can add it to a function. We define the function `counter` using the `def` keyword. It takes one argument (`list_to_search`)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def counter(list_to_search):                 \n",
      "    counts = {}                              \n",
      "    for word in list_to_search:              \n",
      "        if word in counts:                   \n",
      "            counts[word] = counts[word] + 1  \n",
      "        else:                                \n",
      "            counts[word] = 1                 \n",
      "    return counts                            "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Hopefully we are boring you, but let's go through this function step by step.\n",
      "\n",
      "1. We define a function using `def` and give it the name `counter` (line 1);\n",
      "2. This function takes a single argument `list_to_search` which is the list we want to search through (line 1);\n",
      "3. Next we define a variable `counts` which is an empty dictionary (line 2);\n",
      "4. We loop over all words in `list_to_search` (line 3);\n",
      "5. If the word is already in `counts`, we look up its current value and add 1 to it (line 4-5);\n",
      "6. If the word is not in `counts` (else clause), we add the word to the dictionary and assign it the value 1 (line 6-7);\n",
      "7. Return the result of counts (line 8);\n",
      "\n",
      "Let's try out our new function!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(counter(words))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's put some of the stuff we learnt so far together. What we want you to do is to read into Python the file `data/austen-emma.txt`, convert it to a list of words and assign to the variable `emma_count` how often the word *Emma* occurs in the text."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "emma_count = 0\n",
      "# insert you code here\n",
      "\n",
      "# The following test should print True if your code is correct \n",
      "print(emma_count == 481)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "A count function (take 2)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's train our function writing skills a little more. We are going to write another counting function, this time using a slightly different strategy. Recall our function `count_in_list`. It takes as argument a list and the item we want to count in that list. It returns the number of times this item occurs in the list. If we call this function for each unique word in `words`, we obtain a list of frequencies, quite similar to the one we get from the function `counter`. What would happen if we just call the function `count_in_list` on each word in `words`? "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "infile = open('data/austen-emma-excerpt.txt')\n",
      "text = infile.read()\n",
      "infile.close()\n",
      "words = text.split()\n",
      "\n",
      "for word in words:\n",
      "    print(word, count_in_list(word, words))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As you can see, we obtain the frequency of each word token in `words`, where we would like to have it only for unique word forms. The challenge is thus to come up with a way to convert our list of words into a structure with solely unique words. For this Python provides a convenient data structure called `set`. It takes as argument some iterable (e.g. a list) and returns a new object containing only unique items:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = ['a', 'a', 'b', 'b', 'c', 'c', 'c']\n",
      "unique_x = set(x)\n",
      "print(unique_x)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Using `set` we can iterate over all unique words in our word list and print the corresponding frequency:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "unique_words = set(words)\n",
      "for word in unique_words:\n",
      "    print(word, count_in_list(word, words))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We wrap the lines of code above into the function `counter2`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def counter2(list_to_search):\n",
      "    unique_words = set(list_to_search)\n",
      "    for word in unique_words:\n",
      "        print(word, count_in_list(word, list_to_search))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A final check to see whether our function behaves correctly:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "counter2(words)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We have written two functions `counter` and `counter2`, both used to count for each unique item in a particular list how often it occurs in that list. Can you come up with some pros and cons for each function? Why is `counter2` better than `counter` or why is `counter` better than `counter2`?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*Double click this cell and write down your answer.*"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Text clean up"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the previous section we wrote code to compute a frequency distribution of the words in a text stored on our computer. The function `split` is a quick and dirty way of splitting a string into a list of words. However, if we look through the frequency distributions, we notice quite an amount of noise. For instance, the pronoun *her* occurs 4 times, but we also find `her.` occurring 1 time and the capitalized `Her`, also 1 time. Of course we would like to add those counts to that of *her*. As it appears, the tokenization of our text using `split` is fast and simple, but it leaves us with noisy and incorrect frequency distributions. \n",
      "\n",
      "There are essentially two strategies to follow to correct our frequency distributions. The first is to come up with a better procedure of splitting our text into words. The second is to clean-up our text and pass this clean result to the convenient `split` function. For now we will follow the second path.\n",
      "\n",
      "Some words in our text are capitalized. To lowercase these words, Python provides the function `lower`. It operates on strings:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = 'Emma'\n",
      "x_lower = x.lower()\n",
      "print(x_lower)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can apply this function to our complete text to obtain a completely lowercased text, using:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text_lower = text.lower()\n",
      "print(text_lower)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This solves our problem with miscounting capitalized words, leaving us with the problem of punctuation. The function `replace` is just the function we're looking for. It takes two arguments: (1) the string we would like to replace and (2) the string we want to replace the first argument with:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = 'Please. remove. all. dots. from. this. sentence.'\n",
      "x = x.replace(\".\", \"\")\n",
      "print(x)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Notice that we replace all dots with an empty string written as `\"\"`. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Write code that to lowercase and remove all commas in the following short text:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "short_text = \"Commas, as it turns out, are so much overestimated.\"\n",
      "# insert your code here\n",
      "\n",
      "# The following test should print True if your code is correct \n",
      "print(short_text == \"commas as it turns out are so much overestimated.\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We would like to remove all punctuation from a text, not just dots and commas. We will write a function called `remove_punc` that removes all (simple) punctuation from a text. Again, there are many ways in which we can write this function. We will show you two of them. The first strategy is to repeatedly call `replace` on the same string each time replacing a different punctuation character with an empty string. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def remove_punc(text):\n",
      "    punctuation = '!@#$%^&*()_-+={}[]:;\"\\'|<>,.?/~`'\n",
      "    for marker in punctuation:\n",
      "        text = text.replace(marker, \"\")\n",
      "    return text\n",
      "\n",
      "short_text = \"Commas, as it turns out, are overestimated. Dots, however, even more so!\"\n",
      "print(remove_punc(short_text))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The second strategy we will follow is to show you that we can achieve the same result without using the built in function `replace`. Remember that a string consists of characters. We can loop over a string accessing each character in turn. Each time we find a punctuation marker we skip to the next character."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def remove_punc2(text):\n",
      "    punctuation = '!@#$%^&*()_-+={}[]:;\"\\'|<>,.?/~`'\n",
      "    clean_text = \"\"\n",
      "    for character in text:\n",
      "        if character not in punctuation:\n",
      "            clean_text += character\n",
      "    return clean_text\n",
      "\n",
      "short_text = \"Commas, as it turns out, are overestimated. Dots, however, even more so!\"\n",
      "print(remove_punc2(short_text))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "1) Can you come up with any pros or cons for each of the two functions above?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*Write your answer here* (double click me)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "2) Now it is time to put everything together. We want to write a function `clean_text` that takes as argument a text represented by string. The function should return this string with all punctuation removed and all characters lowercased."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def clean_text(text):\n",
      "    # insert your code here\n",
      "    \n",
      "# The following test should print True if your code is correct \n",
      "short_text = \"Commas, as it turns out, are overestimated. Dots, however, even more so!\"\n",
      "print(clean_text(short_text) == \n",
      "      \"commas as it turns out are overestimated dots however even more so\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "3) This last excercise puts everything together. We want you to open and read the file `data/austen-emma.txt` text once more, clean up the text and recompute the frequency distribution. Assign to `woodhouse_counts` the number of times the name *Woodhouse* occurs in the text."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "woodhouse_counts = 0\n",
      "# insert your code here\n",
      "\n",
      "# The following test should print True if your code is correct \n",
      "print(woodhouse_counts == 263)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Writing results to a file"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We have accomplished a lot! You have learnt how to read files using Python from your computer, how to manipulate them, clean them up and compute a frequency distribution of the words in a text file. We will finish this chapter with explaining to you how to write your results to a file. We have already seen how to read a text from our disk. Writing to our disk is only slightly different. The following lines of code write a single sentence to the file `first-output.txt`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "outfile = open(\"first-output.txt\", mode=\"w\")\n",
      "outfile.write(\"My first output.\")\n",
      "outfile.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Go ahead and open the file `first-output.txt` located in the folder where this course resides. As you can see it contains the line `My first output.`. To write something to a file we open, just as in the case of reading a file, a `TextIOWrapper` which can be seen as a connection to the file `first-output.txt`. The difference with opening a file for reading is the *mode* with which we open the connection. Here the mode says `w`, meaning \"open the file for writing\". To open a file for reading, we set the mode to `r`. However, since this is Python's default setting, we may omit it."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the final quiz of this chapter we will ask you to write the frequency distribution over the words in `data/austen-emma.txt` to the file `data/austen-frequency-distribution.txt`. We will give you some code to get you started"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# first open and read data/austen-emma.txt. Don't forget to close the infile\n",
      "infile = open(\"data/austen-emma.txt\")\n",
      "text = # read the contents of the infile\n",
      "# close the file handler\n",
      "# clean the text\n",
      "\n",
      "# next compute the frequency distribution using the function counter\n",
      "frequency_distribution = \n",
      "\n",
      "# now open the file data/austen-frequency-distribution.txt for writing\n",
      "outfile = \n",
      "\n",
      "for word, frequency in frequency_distribution.items():\n",
      "    outfile.write(word + \";\" + str(frequency) + '\\n')\n",
      "    \n",
      "# close the outfile"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Ignore the following, it's just here to make the page pretty:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.core.display import HTML\n",
      "def css_styling():\n",
      "    styles = open(\"styles/custom.css\", \"r\").read()\n",
      "    return HTML(styles)\n",
      "css_styling()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "/*\n",
        "Placeholder for custom user CSS\n",
        "\n",
        "mainly to be overridden in profile/static/custom/custom.css\n",
        "\n",
        "This will always be an empty file in IPython\n",
        "*/\n",
        "<style>\n",
        "    @import url(http://fonts.googleapis.com/css?family=Roboto:400,300,300italic,400italic,700,700italic);\n",
        "\n",
        "    div.cell{\n",
        "        font-family:'roboto','helvetica','sans';\n",
        "        color:#444;\n",
        "        width:800px;\n",
        "        margin-left:16% !important;\n",
        "        margin-right:auto;\n",
        "    }\n",
        "\n",
        "    div.text_cell_render{\n",
        "        font-family: 'roboto','helvetica','sans';\n",
        "        line-height: 145%;\n",
        "        font-size: 120%;\n",
        "        color:#444;\n",
        "        width:800px;\n",
        "        margin-left:auto;\n",
        "        margin-right:auto;\n",
        "    }\n",
        "    .CodeMirror{\n",
        "            font-family: \"Menlo\", source-code-pro,Consolas, monospace;\n",
        "    }\n",
        "    .prompt{\n",
        "        display: None;\n",
        "    }    \n",
        "    .warning{\n",
        "        color: rgb( 240, 20, 20 )\n",
        "        }  \n",
        "</style>\n",
        "<script>\n",
        "    MathJax.Hub.Config({\n",
        "                        TeX: {\n",
        "                           extensions: [\"AMSmath.js\"]\n",
        "                           },\n",
        "                tex2jax: {\n",
        "                    inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
        "                    displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
        "                },\n",
        "                displayAlign: 'center', // Change this to 'center' to center equations.\n",
        "                \"HTML-CSS\": {\n",
        "                    styles: {'.MathJax_Display': {\"margin\": 4}}\n",
        "                }\n",
        "        });\n",
        "</script>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 1,
       "text": [
        "<IPython.core.display.HTML at 0x1036b05f8>"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p><small><a rel=\"license\" href=\"http://creativecommons.org/licenses/by-sa/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-sa/4.0/88x31.png\" /></a><br /><span xmlns:dct=\"http://purl.org/dc/terms/\" property=\"dct:title\">Python Programming for the Humanities</span> by <a xmlns:cc=\"http://creativecommons.org/ns#\" href=\"http://fbkarsdorp.github.io/python-course\" property=\"cc:attributionName\" rel=\"cc:attributionURL\">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-sa/4.0/\">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct=\"http://purl.org/dc/terms/\" href=\"https://github.com/fbkarsdorp/python-course\" rel=\"dct:source\">https://github.com/fbkarsdorp/python-course</a>.</small></p>"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}